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Abstract 

We have implemented a parallel version of the 
Barnes-Hut 3-D N-body tree algorithm under PVM 
3.2.5, adopting an SPMD paradigm. We parallelize 
the problem by decomposing the physical domain by 
means of the Orthogonal Recursive Bisection oct- 
tree scheme suggested by Salmon (1991), but we mod- 
ify the original hypercube communication pattern into 
an incomplete hypercube, which is more suitable for a 
generic inhomogenous cluster architecture. 
We address dynamical load balancing by assigning dif- 
ferent "weights" to the spawned tasks according to the 
dynamically changing workloads of each task. The 
weights are determined by monitoring the local plat- 
forms where the tasks are running and estimating the 
performance of each task. The monitoring scheme is 
flexible and allows us to address at the same time clus- 
ter and intrinsic sources of load imbalance. We then 
show measurements of the performance of our code on 
a test case of astrophysical interest in order to test the 
performance of our load-balancing scheme. 



1 Introduction 

Numerical simulations of gravitationally interacting 
particles have become one of the most powerful tool 
of contemporary cosmology. Objects ranging in size 
from stellar globular clusters up to clusters of galax- 
ies, and including elliptical and spiral galaxies, can 
be regarded as made of a large collection of point- 
like particles interacting through the Newton's law of 
gravitation. The Gravitational N-body Problem 
aims at finding a description of the dynamics of such 
systems of N gravitationally interacting particles by 
solving directly their equations of motion: 
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In the above equations G denotes Newton's con- 
stant, nii is the mass of i-th particle and the index i 
runs form 1 to N. 

These equations describe the dynamics of a system 
of point-like particles interacting only through their 
mutual gravitational force. Although stars within a 
globular cluster or a galaxy within a galaxy cluster 
are extended objects, they can be considered point- 
like as far as the density and the typical velocity are 
enough small as to make the probability of close in- 
teractions (which could destroy individual objects and 
then create new objects within the system) very small. 
Back-of-the-envelope calculations show that this con- 
dition is fulfilled at least for typical galaxies and clus- 
ters of galaxies, so that eq.( provides a reasonable 
description of the system. The gravitational N-body 
problem provides then a good model of these systems, 
although it does not take into account the presence 
of gas, which is observed within galaxies and clusters 
of galaxies. However, one can show that the gas can 
significantly affect the dynamics of galaxies only after 
they have formed, and has little influence in determin- 
ing the global observed properties of clusters of galax- 
ies. Under many respects, eq.( [l]) provides an accurate 
description of the objects which trace the Large Scale 
Structure of the Universe. This is even more believable 
when one considers that, according to modern cosmol- 
ogy, 90 % of the total gravitating mass of the Universe 
should be made of particles which do not emit an ap- 
preciable amount of radiation at any wavelength and 
interact with the visible matter which makes up stars 
and galaxies only through their gravitational action. 
This Dark Matter, if it exists, dominates the mass 
content of the Universe, and its dynamics is exactly 



described by eq.( |l|). 

System of equations similar to eq. ( |l|) are encountered 
in other fields of physics, e.g. in plasma dynamics and 
in molecular dynamics. However, in the latter case 
the interaction force decays usually faster than r~ 2 
with distance, so that one commits a small error in 
considering the interaction as a local one. In plasma 
dynamics the product mirrij in eq.( [I]) is replaced by 
the product of the charges qiqj, and in stationary neu- 
tral plasma the repulsive and attractive interaction 
balance in such a way that on scales larger than a 
"Debye radius" the plasma can be considered as neu- 
tral. This fact simplifies enormously the calculations: 
in fact one needs to consider only the interactions of 
a particle with those particles lying approximaztely 
within a Debye radius in order to give a detailed ac- 
count of the collective dynamics of a plasma. This 
however does not hold for the gravitational force: be- 
ing always attractive, and decaying not enough fast 
with distance, the gravitational interaction does not 
allow any "screening" . The analogous of the Debye ra- 
dius for gravitating system does not exist: the contri- 
bution to the gravitational force from a large collection 
of far bodies is on average quantitatively as important 
as that coming from random enconuters with nearby 
bodies. These facts all influence the choice of the cor- 
rect integration methods of the above equations. 
The use of N-body computer codes to find approxi- 
mate numerical solutions of the eqs. ( [l]) has become 
a primary tool of the theoretical research in cosmology. 
The simplest and oldest algorithm was based on a di- 
rect solution of the equations for each body (0). One 
sees immediately that the computer work will increase 
as 0(N 2 ) in this method, a cost which makes pro- 
hibitive simulations of more than about 10.000 parti- 
cles even on the largest present-day parallel machines. 
This difficulty prompted for the search of alterna- 
tive algorithms which could circumvent this problem. 
Eq.( [l]) can be rewritten in terms of multipole expan- 
sion as: 
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In this formula M is the total mass of the system, and 
a subscript cm means that the given quantity is com- 
puted at the center of mass of the system. The second 
term on the right-hand side is the quadrupole term, a 
set of 6 quantities obtained from integrals over the sys- 
tem, and we have omitted to report the higher order 
terms. The summation over all the particles in eq.( ^) 
has now been replaced by a summation over the mul- 
tipole expansion. Although these latter are infinite in 



number, their magnitude is a fast decreasing function 
of distance from the center of mass, so we commit a 
little error by omitting the higher order terms in this 
approximation. 

The use of eq. ( |^) to compute the force is at the heart 
of the Particle-Mesh (PM) and (Particle-Particle)- 
(Particle-Mesh) (P 3 M) numerical approaches (sec || 
for a comprehensive treatment). In the PM mehod the 
gravitational potential and force are calculated from 
a Fast Fourier Transform of the density distribution, 
and the force is computed at the corners of a grid 
superimposed on the system. The spatial resolution 
is rigidly fixed by the size of the mesh, and cannot 
be changed during the simulation. The approxima- 
tion then becomes a poor one as the system evolves 
toward an inhomogeneous situation, as it often hap- 
pens in cosmological simulations in which clusters of 
galaxies form out of an almost uniform initial state. 
In the P 3 M the interaction with the nearest neigh- 
bouring particles is computed exactly, but otherwyse 
the computation scheme is that of a PM. In both 
these schemes the computational cost grows at most as 
0(N log N), but the grids are fixed and the final spa- 
tial resolution depends on their size, i.e. on the num- 
ber of particles adopted. Another multipole expansion 
scheme was devised by Greengard and Rokhlin || . 
An adaptive algorithm class of multipole expansion 
methods is based on the tree decomposition to rep- 
resent the structure of the gravitational interactions. 
Our code is based on this algorithm, and in the next 
section we will look at it in more detail. 

2 Tree Methods. 

In the Barnes-Hut tree method the space domain 
containing the system is divided into a set of cubic 
cells by means of an oct-tree decomposition: start- 
ing from a root cell containing all the particles each 
cell is further subdivided into 8 cells, until the last 
cells contain only 1 or bodies. This structure is the 
tree. For those cells of the tree containing more than 
one body one stores the position, size, total mass and 
quadrupole moment in corresponding arrays. Cells 
containing only one body, on the other hand, store 
only the position of the body. To compute the force 
on a given particle, one inspects the tree, i.e. compare 
the particle's position with the distance and position 
of each cell of the tree. When the distance r of the 
particle from the edge of the given cell is such that the 
Cell Opening Criterion ( COC) is verified, i.e. when: 
r/d < 9, where d is the size of the edge of the cell and 
9 is a fixed parameter, the cell is "opened", i.e. one 



inspects the cells which are "daughters" of the current 
cell. Cells containing only one body and cells which 
do not fulfill the COC are considered for interaction: 
their monopole and quadrupole moments (if they have 
one) are added up to the total gravitational force felt 
by the body. In this way, the interaction with near- 
est single particles is computed exactly, while groups 
of particles which are far enough treated as extended 
objects characterized by a monopole and a quadrupole 
interaction. 

The Barnes-Hut scheme has been implemented in var- 
ious numerical N-body codes. It is adaptive (the tree 
is reconstructed after each time step) and through the 
parameter 8 allows a control on the accuracy of the 
force calculation, ranging from the case 8 = (corre- 
sponding to direct interactions) to larger values. The 
parallel PVM N-body code which we have developed is 
based on the FORTRAN version of the vectorial code 
written by dr. L. Hernquist Q, who kindly provided 
us with a copy of his latest version. However, as we 
will show later, the communication structure of our 
CAPANIC code bears little resemblance with that of 
the original Hernquist's code. 

A parallel implementation of the Barnes-Hut algo- 
rithm was made by Salmon @, ||, ||, @, @. 
It was devised to run on massively parallel systems 
like the CM-5 and the Touchstone Delta, and the par- 
allelization was done exploiting features of the FOR- 
TRAN compilers on these machines. Another imple- 
mentation on a dedicated, transputer -based machine, 
(GRAPE 1-A) was made by Makino and coll. g, gj. 
Although the results are very encouraging, these codes 
are difficul to export on platforms different from those 
for which they were originally devised. Moreover, they 
are devised for homogeneous clusters of processors, 
where all the processors have the same characteris- 
tics, although some of them account for dynamical 
load balancing among different processors. 
The strategy behind our parallel N-body code is com- 
plementary to that of the above quoted papers. We 
desired to produce a public software which could be 
easily implemented on different platforms, and partic- 
ularly on already existing clusters of workstations. It 
was then necessary to perform the parallelization by 
adopting products which are (or start to be viewed as) 
standards, and which can be readily obtained by ev- 
erybody. This motivated our choice of making use 
of the Parallel Virtual Machine (PVM ) software, 
which is now emerging as a standard and is particu- 
larly suited to implement parallel algorithms on clus- 
ters of heterogeneous workstations. 
When one writes a parallel application one has to 



make a choice between two possible schemes: a "mas- 
ter/slave" and a "Single Program Multiple Data" 
(SPMD) one. In the latter a single program is 
"cloned" within the tasks spawned by the initial ap- 
plication, and the cloned parts control the commu- 
nication of data among the different processes. We 
adopted this latter scheme because all the processors 
are treated on an equal foot, while in the master/slave 
scheme the processor which hosts the master process 
does a different job, and generally it spends a long 
time waiting for the data to come back from the slave 
processes. On the other hand, an SPMD scheme, al- 
though is generally more difncul to implement, is more 
suited for dynamical load balancing, one of the crucial 
issues addressed in our work. 

The result of this effort is CAPANIC (CAtania 
PArallcl N-body Code for Inhomogeneous Clusters 
of Workstations), which described in the following sec- 
tions. 



3 PVM and parallelism. 

3.1 Overview. 

CAPANIC has been devised to work under PVM 
(Parallel Virtual Machine), a package freely dis- 
tributed by the Oak Ridge National Laboratory, Ten- 
nessee. We run it on a cluster comprising a Convex 
C210 and various Sun workstations at our Institute. 

3.2 PVM short description. 

PVM is based on messages passing between het- 
erogeneous hosts on the network. It creates on each 
host belonging to the virtual machine a daemon which 
is able to send/receive messages and data packets 
to/from other daemons loaded on other hosts. In this 
way a collection of different processors can be seen by 
a specified job as a large, single virtual machine. 
PVM automatically start up tasks on the hosts and 
provide a suitable set of subroutines for message pass- 
ing in order to allow the tasks to communicate and 
synchronize with each other. Applications written in 
Fortran77 and C can be parallelized using the PVM 
message passing contructs, so that several communi- 
cating tasks can cooperate to solve the problem. 
Each process enrolling in PVM is assigned an inte- 
ger task identifier (tid) that unambigously identifies 
it. The tids are unique across the virtual machine and 
are provided by the PVM daemons. 
PVM supplies routines for packing and sending mes- 
sages and data among tasks on the virtual machine. 



Each message of the transmitting task is packed on a 
send-buffer on the host and sent to the receive-buffer 
of the receiving tasks. The communication model pro- 
vides asynchronous blocking send functions that re- 
turns when the send buffer is free for reuse (recep- 
tion is completed on the receive buffer of the receiving 
tasks); asynchronous blocking receive function that re- 
turns when the data are received in the buffer; and 
asynchronous non-blocking receive function that re- 
turns with either the data or a flag that data has not 
arrived. 

PVM supports point-to-point communication and 
multicast to a set of taks and broadcast to a user 
defined group of tasks. Message buffer are allocated 
dinamically, so that the maximum size messages that 
can be sent or received depends only on the available 
memory of the hosts. 



4 CAPANIC, inhomogeneous clusters 
and load balancing 

The CAPANIC software has been devised to run 
on an inhomogeneous cluster of hosts forming the vir- 
tual machine. At our site the hosts are used as general 
purpose computers and their local load can be strongly 
variable with time. 

At the beginning the code divides the physical domain 
occupied by the system into an incomplete hypercube, 
and assigns the bodies contained within M regions of 
the hypercube to an equal amount of tasks which are 
spawned on the hosts of the virtual machine. In order 
to avoid "load imbalance" the domain decomposition 
is performed taking into account the fact that the to- 
tal workload on each task depends on two different 
types of parameters. Parameters of the first type, Ai 
(i=l..M), depend on the characteristics of the hosts 
where the tasks will be spawned; those of the second 
type, Bj (j=l..N), depend on the number of interac- 
tions that are necessary to calculate the force on the 
body. The Ai parameters are evaluated from the sta- 
tistical average load and the characteristics of the host 
, and increase with the average performances associ- 
ated with the host. The Bj parameter are evaluated 
using information from previous runs, and account for 
the intrinsic work done by the spawned tasks to ad- 
vance in time the positions and velocities of the parti- 
cles which have been assigned to it. This work depends 
strongly on the structure and depth of the tree: ulti- 
mately on the geometry and mass distribution of the 
particles within the system. At the beginning we put 
Bj = 1 for all the particles. 



4.1 The domain decomposition 

Following Salmon JO] we use an orthogonal recur- 
sive bisection (ORB) scheme to partition the entire 
domain in M subdomains and to assign the particles 
of each partition to a task. Each cutting plane (bisec- 
tor) of the partition splits the domain into two sub- 
domains to which a set of processors is assigned. The 
domain decomposition proceeds until only one pro- 
cessor is assigned to each subdomain. The position of 
each bisector is determined in such a way as to have 
the same workload in each of the subdomains. 
Let us introduce the function W(x) defined as the ra- 
tio of the works associated with the subdomain and 
the work associated with the parent domain: 

. . Work (subdomain) , , 

W ( X ) = TT7 , / 1 3 

W ork (parent domain) 

where the Work function is evaluated as the following 
ratio 

W or k (region) = 3 - (4) 

where j = 1, Nbodies runs over the bodies of the 
region, and i = 1, N proc is the total number of pro- 
cessors assigned to the region. The position of the bi- 
sector x sp ut is choosen so that W(x sp nt) = 0.5. After 
the domain decomposition, the tasks are spawned on 
the virtual machine, and the properties of the bodies 
of the subdomain assigned to each task are delivered, 
so that each task contain only the right amount of in- 
formation needed to compute the forces on the bodies 
assigned to it. 



5 Results. 

The virtual machine we have used to develop and 
test our application is formed by the following hosts: 
one Convex C210 machine having 64 MB Ram, one 
Sun Sparc 10 (first generation) having 32 MB Ram, 
one Sun Sparc 10 (last generation) having 32 MB 
Ram; three Sun Sparc 2 having 16 MB Ram, one Sun 
Sparclassic having 16 MB Ram. Runs were realized in 
order to compare the efficiency of the CAPANIC code 
on the virtual machine, in comparison with the vecto- 
rial code of Hernquist. Several tests were realized for 
a system of 8000 bodies in a configuration evolving 
slowly for about 20 time-step. It is useful to distigu- 
ish 4 phases at each time-step: 1- local tree formation 
(computational phase); 2 - send/receive trees (com- 
munication phase); 3 - locally essential tree formation, 



force evaluation on each local body, update bodies pro- 

prierties (computational phase); 4 - body migration 

and synchronization (communication phase). 

We reports only the results of the most significant 

tests. 

5.1 First Test 

Here we run the serial Hernquist's code on CON- 
VEX machine with, on average, 40% CPU, so that the 
time-step duration was about 55 sec. Using a full ded- 
icated Sun Sparc 2 the time-step duration was about 
172 sec. 

5.2 Second Test 

Using CAPANIC we performed three runs using 2, 
4 and 8 tasks spawned on the CONVEX, with 4000, 
2000 and 1000 bodies for each task, respectively. The 
following figure show the results. 



In the B phase, the tasks send and receive the infor- 
mation from each other task of the application. For 
each bisector of the splitted domain the cells of the lo- 
cal tree in each task are checked: those which do not 
satisfy a Domain Opening criterion [DOC] (Jl^j) are 
enqueued for sending, while those which satisfy the 
DOC are opened and their daughters are inspected. 
Then the task sends the cell properties to the proces- 
sor set on the other side of the bisectors. After this 
step, it receives cell properties from all the tasks of 
the application. During this phase the computational 
time is negligible and we can consider this time to be 
dominated by the time of the communication phase. 
In the C phase, using the received cells, the task builds 
the locally essential tree (@). For each local body, 
traversing the locally essential tree, the force acting 
on it is evaluated and body properties are updated. 
The computational time spent in this phase depends 
on the same factors of the A phase. 
In the D phase, local bodies that are out of the spatial 
region assigned to the task, are deleted from the list 
of local bodies and delivered to the tasks that have 
the spatial region including the new bodies position 
(body migration phase). At the same time a task re- 
ceives bodies from the other task. During this phase 
the computational time is negligible and we can con- 
sider this time to be dominated by the communication 
phase. 

We can note that the communication and synchroniza- 
tion phases (B and D phases) increase from 2.3% of 
the total time-step (2 tasks) to 38.6% (8 tasks). This 
is essentially due to the load of the PVM daemon that 
serves all the pvm calls, whereas the time for the com- 
putational phase decrease from 97.7% to 61.4% of the 
total time step. The time step duration decrease from 
35.03 sec to about 40 sec as the number of spawned 
tasks on the same host increases. 



5.3 Third Test 



As reported in the figure, we can distinguish four 
phases. In the A phase, using only the local bodies, 
the tasks form a local tree. The computational time 
spent in this phase depends on two factors: the num- 
ber of bodies assigned to each task, which decreases as 
the number of spawned tasks increase, and the total 
CPU time, that increase as the number of spawned 
task increase, although the CPU time for each task 
decrease from 35% (2 spawned task) to 11% (8 tasks). 



We have spawned 2 tasks: 1 task on Convex C210 
host (using 45% CPU time) with 6300 bodies and 1 
task on Sun Sparc 2 host (using 90% CPU time) with 
1700 bodies. We have verified that with this choice the 
workload is approximately the same in both tasks, and 



we obtain the results shown in the following figure: bodies and the Sun Sparc 10 (last generation) 1640 

bodies. Using this domain distribution the load was 
alomost equally balanced between the tasks, and we 
obtain the following results: 



We can note that this test produces results similar 
to the previous one using 2 task spawned on the Con- 
vex. The time spent in the B and D phases T(comm) 
depends on three factors: T(l), the latency time due 
to the PVM software, T(n) the network transfer time 
and T(s), the waiting time for synchronization. We 
have: 

T(comm) = T(l) + T(n) + T(s) (5) 

During the run the total network capability was used 
for the communication phase. For the task spawned 
on the Convex the following results were obtained: 

T(l)=T(n); T(l)=0.17 sec; T(s) was negligible 

For the task spawned on Sun Sparc 10 host the fol- 
lowing data were evaluated: 

T(l) = T{n):T{l) = 0.17sec;T(s) = 8.9*T(n) « l.blsec 

(6) 

5.4 Fourth Test 

We have spawned 7 tasks: 1 task on Convex ma- 
chine using 50% CPU time with 2462 bodies; 1 task 
for each Sun Station of the virtual machine (full ded- 
icated to run the tasks). The Sun Sparclassic and the 
Sun Sparc2 machines with 667 bodies (667 x 4 = 2668 
bodies); the Sun Sparc 10 (first generation) with 1230 



It is interesting to observe that, in comparison with 
the second test using 8 task on Convex, the commu- 
nication and synchro phase decrease from 15.3 sec. to 
6.3 sec (considering the slowest machines), and the 
following results for the speed-up S were obtained: 

T(serial code on Convex) ^ 

1 T(7 spawned tasks on the virtual machine) 

(7) 

and 

T(serial code on SunSparc 2) 

2 T(7 spawned tasks on the virtual machine) 

(8) 

where T(machine) is the time step duration. 
In the CAPANIC application T(machine) can be con- 
sidered as the sum of the computational time T(comp) 
and the communication and synchronization phase 
T (coram). Depending on the host were the task was 
running T(comm) = T(l) + T(n) + T(s) we derive the 
following values: 

Sun Sparcstation 2: T(s) negligible; 
T(n)= 2*T(1) = 4.2 sec. 

Sun Sparcstation 10: T(s) =5.2 sec; 
T(n)= 2*T(1) = 4.2 sec. 



Convex C210: T(s)= 7 sec; 
T(n)= 2*T(1) = 4.2 sec. 



6 Summary and Future Prospects. 

The results presented in the preceding section 
demonstrate the efficiency of the PVM package in par- 
allelizing an highly adaptive, dynamically changing al- 
ogorithm like the Barnes-Hut tree application. Our 
tests show that the total speed-up rises to approxi- 
mately 6.2 for our inhomogcncous cluster. It is true 
that we evolved our system only for 20 time steps, so 
our system had not enough time to become very in- 
homogeneous, but it has been observed that the load 
does not depend very much on the degree of inhomo- 
geneity of the system (|L(|). Probably the total num- 
ber of particles is a more crucial parameter, and we 
plan to test CAPANIC for initial configurations hav- 
ing N > 8000. 

We are now planning to extend the original tree al- 
gorithm to include the gas. As we said in the Intro- 
duction, this component does not affect significantly 
the dynamics on scales larger than those of individua 
galaxies. However it is the component which emits 
most of the visible light in the Universe, so it is im- 
portant to introduce it within a code designed for cos- 
mological simulations. An obvious choice would be to 
extend our CAPANIC code by "merging" it with an 
SPH code, but recently Motta and Wick (0) have 
proposed a particle approximation scheme to solve 
numerically the Fokker-Planck equation which is far 
more accurate and simple to implement than the SPH. 
At variance with the SPH the Motta- Wick algorithm is 
based on an approximation to an exact solution of the 
kinetic equations, where the fluid elements are treated 
as gas particles having a dynamics specified by a set 
of gravitational equations and fluid equations. The 
main problem lies in the fact that the introduction of 
a second class of particles brings cell-cell communica- 
tions into the game, a feature which complicates the 
communications. We will report on this attempt in 
another paper. 
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